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Abstract 

We consider moment matching techniques for estimation in latent Dirichlet allo¬ 
cation (LDA). By drawing explicit links between LDA and discrete versions of 
independent component analysis (ICA), we first derive a new set of cumulant- 
based tensors, with an improved sample complexity. Moreover, we reuse standard 
ICA techniques such as joint diagonalization of tensors to improve over existing 
methods based on the tensor power method. In an extensive set of experiments on 
both synthetic and real datasets, we show that our new combination of tensors and 
orthogonal joint diagonalization techniques outperforms existing moment match¬ 
ing methods. 


1 Introduction 


Topic models have emerged as flexible and important tools for the modelisation of text corpora. 
While early work has focused on graphical-model approximate inference techniques such as varia¬ 
tional inference [ 1] or Gibbs sampling [2], tensor-based moment matching techniques have recently 
emerged as strong competitors due to their computational speed and theoretical guarantees [3, 4]. 
In this paper, we draw explicit links with the independent component analysis (ICA) literature 
(e.g., [5] and references therein) by showing a strong relationship between latent Dirichlet allocation 
(LDA) [1] and ICA [6, 7, 8]. We can then reuse standard ICA techniques and results, and derive new 
tensors with better sample complexity and new algorithms based on joint diagonalization. 


2 Is LDA discrete PC A or discrete ICA? 

Notation. Following the text modeling terminology, we define a corpus X = {xi ,..., xat} as a 
collection of N documents. Each document is a collection {wni, ■ ■ ■, WuLr^} of Ln tokens. It is 
convenient to represent the f-th token of the n-th document as a 1-of-M encoding with an indicator 
vector Wnt C {0,1}^ with only one non-zero, where M is the vocabulary size, and each document 
as the count vector := C . In such representation, the length of the n-th 

document is Xnm- We will always use the index k G AT} to refer to topics, the 

index n G {l,...,iV}to refer to documents, the index m G {l,...,M}to refer to words from 
the vocabulary, and the index £ G {1,..., L„} to refer to tokens of the n-th document. The plate 
diagrams of the models from this section are presented in Appendix A. 

Latent Dirichlet allocation [1] is a generative probabilistic model for discrete data such as text 
corpora. In accordance to this model, the n-th document is modeled as an admixture over the vo¬ 
cabulary of M words with K latent topics. Specifically, the latent variable which is sampled 
from the Dirichlet distribution, represents the topic mixture proportion over K topics for the n-th 
document. Given 9n, the topic choice Zne \ 9n for the ^-th token is sampled from the multinomial dis¬ 
tribution with the probability vector . The token Wni \ Zni , is then sampled from the multinomial 
distribution with the probability vector dz^f, or dk if k is the index of the non-zero element in Zni- 
This vector dk is the fc-th topic, that is a vector of probabilities over the words from the vocabulary 
subject to the simplex constraint, i.e., dk G Am, where Am := {d G : d y 0, dm = !}■ 
This generative process of a document (the index n is omitted for simplicity) can be summarized as 
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9 ^ Dirichlet(c), 

^ Multinomial(l, 0), (1) 

wt,\zi,9 ~ Multinomial( 1,cijj). 

One can think of the latent variables zg as auxiliary variables which were introduced for convenience 
of inference, but can in fact be marginalized out [9], which leads to the following model 


9 ^ Dirichlet(c), 
x\9 ^ Multinomial(L, i90), 


LDA model (2) 


where D G is the topic matrix with the fc-th column equal to the fc-th topic dk, and c G M++ 

is the vector of parameters for the Dirichlet distribution. While a document is represented as a set 
of tokens we, in the formulation (1), the formulation (2) instead compactly represents a document as 
the count vector x. Although the two representations are equivalent, we focus on the second one in 
this paper and therefore refer to it as the LDA model. 

Importantly, the LDA model does not model the length of documents. Indeed, although the original 
paper [1] proposes to model the document length as L| A ^ Poisson(A), this is never used in practice 
and, in particular, the parameter A is not learned. Therefore, in the way that the LDA model is 
typically used, it does not provide a complete generative process of a document as there is no rule to 
sample L|A. In this paper, this fact is important, as we need to model the document length in order 
to make the link with discrete ICA. 


Discrete PCA. The LDA model (2) can be seen as a discretization of principal component anal¬ 
ysis (PCA) via replacement of the normal likelihood with the multinomial one and adjusting the 
prior [9] in the following probabilistic PCA model [10, 11]: 9 ^ Normal(0, lx) and x\9 ^ 
Normal(i90, a^lM), where D G is a transformation matrix and tr is a parameter. 

Discrete ICA (DICA). Interestingly, a small extension of the LDA model allows its interpreta¬ 
tion as a discrete independent component analysis model. The extension naturally arises when the 
document length for the LDA model is modeled as a random variable from the gamma-Poisson 
mixture (which is equivalent to a negative binomial random variable), i.e., L|A ^ Poisson(A) and 
A ^ Gamma(co, b), where cq := is the shape parameter and & > 0 is the rate parameter. The 

LDA model (2) with such document length is equivalent (see Appendix B. 1) to 


Qffc Gamma(cfc,&), 
Xm\a Poisson([i9a]m), 


GP model (3) 


where all ai, 0 : 2 ,..., ax are mutually independent, the parameters Ck coincide with the ones of the 
LDA model in (2), and the free parameter b can be seen (see Appendix B.2) as a scaling parameter 
for the document length when cq is already prescribed. 

This model was introduced by Canny [12] and later named as a discrete ICA model [13]. It is more 
natural, however, to name model (3) as the gamma-Poisson (GP) model and the model 


Ok ~ mutually independent, 

Xm\oi ^ Poisson([£)a]m) 


DICA model (4) 


as the discrete ICA (DICA) model. The only difference between (4) and the standard ICA model [6, 
7, 8] (without additive noise) is the presence of the Poisson noise which enforces discrete, instead of 
continuous, values of Xm- Note also that (a) the discrete ICA model is a semi-parametric model that 
can adapt to any distribution on the topic intensities ak and that (b) the GP model (3) is a particular 
case of both the LDA model (2) and the DICA model (4). 

Thanks to this close connection between LDA and ICA, we can reuse standard ICA techniques to 
derive new efficient algorithms for topic modeling. 


3 Moment matching for topic modeling 

The method of moments estimates latent parameters of a probabilistic model by matching theoretical 
expressions of its moments with their sample estimates. Recently [3, 4], the method of moments 
was applied to different latent variable models including LDA, resulting in computationally fast 
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learning algorithms with theoretical guarantees. For LDA, they (a) construct LDA moments with a 
particular diagonal structure and (b) develop algorithms for estimating the parameters of the model 
by exploiting this diagonal structure. In this paper, we introduce the novel GP/DICA cumulants with 
a similar to the LDA moments structure. This structure allows to reapply the algorithms of [3, 4] 
for the estimation of the model parameters, with the same theoretical guarantees. We also consider 
another algorithm applicable to both the LDA moments and the GP/DICA cumulants. 

3.1 Cumulants of the GP and DICA models 

In this section, we derive and analyze the novel cumulants of the DICA model. As the GP model is 
a particular case of the DICA model, all results of this section extend to the GP model. 

The first three cumulant tensors for the random vector x can be defined as follows 

cum(a:) := ]E(a:), (5) 

cum(x, x) := coy{x, x) = E [(a: — E(a:))(x — E(a:))^] , (6) 

cum(x, X, x) := E [(a; — E(a;)) (Si {x — E(a;)) (S {x — E(a;))] , (7) 

where 0 denotes the tensor product (see some properties of cumulants in Appendix C.l). The 
essential property of the cumulants (which does not hold for the moments) that we use in this paper 
is that the cumulant tensor for a random vector with independent components is diagonal. 

Let y = Da', then for the Poisson random variable XmlUm ^ Poisson(?/m), the expectation is 
'^{xm\ym) = Vm- Hcncc, by the law of total expectation and the linearity of expectation, the 
expectation in (5) has the following form 

E(a;) = E(E(a;|t/)) = E(?/) = DE(a). (8) 

Further, the variance of the Poisson random variable Xm is var(a:m|ym) = Vm and, as Xi, 
X 2 , ..., XM are conditionally independent given y, then their covariance matrix is diagonal, i.e., 
cov(a;, x\y) = diag(y). Therefore, by the law of total covariance, the covariance in (6) has the form 

cov(a;, cc) = E [cov(a;, a;|y)] + cov [E(a;|y), E(a::|y)] 

= diag [E(y)] + cov(y, y) = diag [E(x)] + Dcov{a, a)D^, 

where the last equality follows by the multilinearity property of cumulants (see Appendix C.l). 
Moving the first term from the RHS of (9) to the LHS, we define 

S := cov(x, x) — diag [E(x)] . DICA S-cum. (10) 

From (9) and by the independence of ai, ..., a/c (see Appendix C.3), S' has the following diagonal 
structure 

S = var(ak)dkd/[ = Ddiag [var(a)] D^. (11) 

By analogy with the second order case, using the law of total cumulance, the multilinearity property 
of cumulants, and the independence of ai, ..., a^, we derive in Appendix C.2 the expression (24), 
similar to (9), for the third cumulant (7). Moving the terms in this expression, we define a tensor T 
with the following element 

[^]mim 2 m 3 ■= cum(xmi,a:m2,a:m3) + 2<)(mi,TO2,m3)E(xmi) DICAT-cum. (12) 

- S(m 2 ,m 3 )cov(xm^,XmJ - S(mi,m 3 ')cov(xm^,XmJ - S(mi, m 2 )cov(xmi, 

where S is the Kronecker delta. By analogy with (11) (Appendix C.3), the diagonal structure of the 
tensor T; 

T = c\xvcL{ak,ak,ak)dk® dk® dk- (13) 

In Appendix E. 1, we recall (in our notation) the matrix S (39) and the tensor T (40) for the LDA 
model [3], which are analogues of the matrix S (10) and the tensor T (12) for the GP/DICA mod¬ 
els. Slightly abusing terminology, we refer to the matrix S (39) and the tensor T (40) as the LDA 
moments and to the matrix S (10) and the tensor T (12) as the GP/DICA cumulants. The diagonal 
structure (41) & (42) of the LDA moments is similar to the diagonal structure (11) & (13) of the 
GP/DICA cumulants, though arising through a slightly different argument, as discussed at the end of 
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Appendix E. 1 . Importantly, due to this similarity, the algorithmic frameworks for both the GP/DICA 
cumulants and the EDA moments coincide. 

The following sample complexity results apply to the sample estimates of the GP cumulants: ' 
Proposition 3.1. Under the GP model, the expected error for the sample estimator S {29) for the 
GP cumulant S (10) is: 

O ( max 
\^/N 

where A := max^ HrffclU. co ■= min(l, cq) and L := E(L). 

A high probability bound could be derived using concentration inequalities for Poisson random 
variables [14]; but the expectation already gives the right order of magnitude for the error (for 
example via Markov’s inequality). The expression (29) for an unbiased finite sample estimate S of S 
and the expression (30) for an unbiased finite sample estimate T of T are defined^ in Appendix C.4. 
A sketch of a proof for Proposition 3.1 can be found in Appendix D. 

By following a similar analysis as in [15], we can rephrase the topic recovery error in term of the 
etTor on the GP cumulant. Irnportantly, the whitening transformation (introduced in Section 4) redi¬ 
vides the error on S (14) by which is the scale of S (see Appendix D.5 for details). This means 
that the contribiUion from S to the recovery error will scale as 0(1/V^max{A, cq/L}), where 
both A and cq jL are smaller than 1 and can be very small. We do not present the exact expression 
for the expected squared etTor for the estimator of T, but due to a similar structure in the derivation, 
we expect the analogous bound of E[||T — T||ir] < 1/sfN max{A^/^Z^, 

Current sample complexity results of the LDA moments [3] can be summarized as 0(1/V^). How¬ 
ever, the proof (which can be found in the supplementary material [15]) analyzes only the case when 
finite sample estimates of the LDA moments are constructed from one triple per document, i.e., 
wi®W 2 ® W 3 only, and not from the U-statistics that average multiple (dependent) triples per doc¬ 
ument as in the practical expressions (43) and (44) (Appendix F.4). Moreover, one has to be careful 
when comparing upper bounds. Nevertheless, comparing the bound ( 14) with the current theoretical 
results for the LDA moments, we see that the GP/DICA cumulants sample complexity contains the 
f 2 -norm of the columns of the topic matrix D in the numerator, as opposed to the 0(1) coefficient 
for the LDA moments. This norm can be significantly smaller than 1 for vectors in the simplex 
(e.g., A = O(l/||(ife||o) for sparse topics). This suggests that the GP/DICA cumulants may have 
better finite sample convergence properties than the LDA moments and our experimental results in 
Section 5.2 are indeed consistent with this statement. 

The GP/DICA cumulants have a somewhat more intuitive derivation than the LDA moments as 
they are expressed via the count vectors x (which are the sufficient statistics for the model) and 
not the tokens wfs. Note also that the construction of the LDA moments depend on the unknown 
parameter cq. Given that we are in an unsupervised setting and that moreover the evaluation of 
LDA is a difficult task [16], setting this parameter is non-trivial. In Appendix G.4, we observe 
experimentally that the LDA moments are somewhat sensitive to the choice of cq. 

4 Diagonalization algorithms 

How is the diagonal structure (11) of Z and (13) of T going to be helpful for the estimation of the 
model parameters? This question has already been thoroughly investigated in the signal processing 
(see, e.g., [17, 18, 19, 20, 21, 5] and references therein) and machine learning (see [3, 4] and refer¬ 
ences therein) literature. We review the approach in this section. Due to similar diagonal structure, 
the algorithms of this section apply to both the LDA moments and the GP/DICA cumulants. 

For simplicity, let us rewrite the expressions (11) and (13) for S and T as follows 

S = '^^SkdkdJ, r = OdfeOdfc, (15) 

'Note that the expected squared error for the DICA cumulants is similar, hut the expressions are less compact 
and, in general, depend on the prior on Ofc. 

^For completeness, we also present the finite sample estimates S (43) and T (44) of S (39) and T (40) for 
the LDA moments (which are consistent with the ones suggested in [4]) in Appendix F.4. 
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where Sk ■= var(afc) and tk ■= cvim{ak,cxk,oik)- Introducing the rescaled topics dk '■= y/^dk, 
we can also rewrite S = DD^. Following the same assumption from [3] that the topic vectors are 
linearly independent {D is full rank), we can compute a whitening matrix W G of S, i.e., 

a matrix such that WSW^ = Ik where Ik is the K-by-K identity matrix (see Appendix F. 1 for 
more details). As a result, the vectors Zk ■= Wdk form an orthonormal set of vectors. 

Further, let us define a projection T(v) € of a tensor T G '^kxKxk ^ vector u G 

T{u)kik2 ■='^ , Tk^k^k^uk^. (16) 

Applying the multilinear transformation (see, e.g., [4] for the definition) with to the tensor T 
from (15) and projecting the resulting tensor T := T(kF^, W^, W^} onto some vector u G R^, 
we obtain 

T(u)=y^ tk{zk,u)zkzj, (17) 

where tk := ^k! ^k"^ is due to the rescaling of topics and (•, •) stands for the inner product. As the 
vectors Zk are orthonormal, the pairs Zk and \k := tk{zk,u) are the eigenpairs of the matrix T{u), 
which are uniquely defined if the eigenvalues Afe are all different. If they are unique, we can recover 
the GP/DICA (as well as LDA) model parameters via dk = Zk and tk = Xk/{zk, u)- 

This procedure was referred to as the spectral algorithm for LDA [3] and the fourth-order^ blind 
identification algorithm for ICA [17, 18]. Indeed, one can expect that the finite sample estimates 
S (29) and T (30) possess approximately the diagonal structure (11) and (13) and, therefore, the rea¬ 
soning from above can be applied, assuming that the effect of the sampling error is controlled. 

This spectral algorithm, however, is known to be quite unstable in practice (see, e.g., [22]). To over¬ 
come this problem, other algorithms were proposed. For ICA, the most notable ones are probably 
the FastICA algorithm [20] and the JADE algorithm [21]. The FastICA algorithm, with appropriate 
choice of a contrast function, estimates iteratively the topics, making use of the orthonormal struc¬ 
ture (17), and performs the deflation procedure at every step. The recently introduced tensor power 
method (TPM) for the LDA model [4] is close to the FastICA algorithm. Alternatively, the JADE al¬ 
gorithm modifies the spectral algorithm by performing multiple projections for (17) and then jointly 
diagonalizing the resulting matrices with an orthogonal matrix. The spectral algorithm is a special 
case of this orthogonal joint diagonalization algorithm when only one projection is chosen. Impor¬ 
tantly, a fast implementation [23] of the orthogonal joint diagonalization algorithm from [24] was 
proposed, which is based on closed-form iterative Jacobi updates (see, e.g., [25] for the later). 

In practice, the orthogonal joint diagonalization (JD) algorithm is more robust than EastICA (see, 
e.g., [26, p. 30]) or the spectral algorithm. Moreover, although the application of the JD algorithm 
for the learning of topic models was mentioned in the literature [4, 27], it was never implemented in 
practice. In this paper, we apply the JD algorithm for the diagonalization of the GP/DICA cumulants 
as well as the LDA moments, which is described in Algorithm 1 . Note that the choice of a projection 
vector Vp G R^ obtained as Vp = W^Up for some vector Up G R^ is important and corresponds to 
the multilinear transformation of T with along the third mode. Importantly, in Algorithm 1 , the 
joint diagonalization routine is performed over (P-f 1) matrices of size KxK, where the number of 
topics K is usually not too big. This makes the algorithm computationally fast (see Appendix G.l). 
The same is true for the spectral algorithm, but not for TPM. 

In Section 5.1, we compare experimentally the performance of the spectral, JD, and TPM algorithms 
for the estimation of the parameters of the GP/DICA as well as LDA models. We are not aware of 
any experimental comparison of these algorithms in the LDA context. While already working on 
this manuscript, the JD algorithm was also independently analyzed by [27] in the context of tensor 
factorization for general latent variable models. However, [27] focused mostly on the comparison 
of approaches for tensor factorization and their stability properties, with brief experiments using a 
latent variable model related but not equivalent to LDA for community detection. In contrast, we 
provide a detailed experimental comparison in the context of LDA in this paper, as well as propose 
a novel cumulant-based estimator. Due to the space restriction the estimation of the topic matrix D 
and the (gamma/Dirichlet) parameter c are moved to Appendix P.6. 

^See Appendix C.5 for a discussion on the orders. 
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Algorithm 1 Joint diagonalization (JD) algorithm for GP/DICA cumulants (or LDA moments) 

1: Input: X G , K, P (number of random projections); (and cq for LDA moments) 

2: Compute sample estimate S G ((29) for GP/DICA / (43) for LDA in Appendix F) 

3: Estimate whitening matrix W G of S (see Appendix F.l) 

option (a): Choose vectors {ui,U2, ■ ■ ■ ,up} C uniformly at random from the unit £2- 
sphere and set Vp = W^Up G for all p = 1,..., P (P = 1 yields the spectral algorithm) 
option (b): Choose vectors {ui,U2,. ■. ,up} C as the canonical basis €1,62, ■ ■ ■ ,eK of 
and set Vp = W^Up G R'^ for all p = 1,..., AT 

4: For Vp, compute Bp = WT{vp)W^ G R^^^ ((52) for GP/DICA / (54) for LDA; Appendix F) 
5: Perform orthogonal joint diagonalization of matrices {PL5'IL^ = Ik, Bp, p = 

(see [24] and [23]) to find an orthogonal matrix V G R^^^ and vectors {oi, 02 ,..., op} C R^ 
such that __ 

VWSW^V^ = Ik, and VBpV^ « diag(ap), p = 1,... ,P 

6: Estimate joint diagonalization matrix A = VW and values ap,p= 1,... ,P 
7: Output: Estimate of D and c as described in Appendix F.6 


5 Experiments 

In this section, (a) we compare experimentally the GP/DICA cumulants with the LDA moments and 
(b) the spectral algorithm [3], the tensor power method [4] (TPM), the joint diagonalization (JD) 
algorithm from Algorithm 1 , and variational inference for LDA [ 1 ]. 

Real data: the associated press (AP) dataset, from D. Blei’s web page,”^ with N = 2, 243 documents 
and M = 10,473 vocabulary words and the average document length L = 194; the NIPS papers 
dataset^ [28] of 2,483 NIPS papers and 14, 036 words, and L = 1, 321; the KOS dataset,^ from the 
UCI Repository, with 3,430 documents and 6,906 words, and L = 136. 

Semi-synthetic data are constructed by analogy with [29]: (1) the LDA parameters D and c are 
learned from the real datasets with variational inference and ( 2 ) toy data are sampled from a model 
of interest with the given parameters D and c. This provides the ground truth parameters D and c. 
For each setting, data are sampled 5 times and the results are averaged. We plot error bars that are 
the minimum and maximum values. For the AP data, K G {10, 50} topics are learned and, for 
the NIPS data, K G {10,90} topics are learned. For larger AT, the obtained topic matrix is ill- 
conditioned, which violates the identifiability condition for topic recovery using moment matching 
techniques [3]. All the documents with less than 3 tokens are resampled. 

Sampling techniques. All the sampling models have the parameter c which is set to c = cqc/ ||c|| 
where c is the learned c from the real dataset with variational LDA, and cq is a parameter that we 
can vary. The GF data are sampled from the gamma-Poisson model (3) with b = cq/A so that 
the expected document length is L (see Appendix B.2). The LDA-fix(L) data are sampled from the 
LDA model (2) with the document length being fixed to a given L. The LDA-fix2('y,Li,L2) data 
are sampled as follows: (1 — 7 )-portion of the documents are sampled from the LDA-fix(Li) model 
with a given document length Li and 7 -portion of the documents are sampled from the LDA-fix(L 2 ) 
model with a given document length L 2 . 

Evaluation. The evaluation of topic recovery for semi-synthetic data is performed with the Pi¬ 
errot between the recovered D and true D topic matrices with the best permutation of columns: 
eiT^j^ (D, D) := minjrgpERM 5 ^ X) fc ~ II 1 £ [0: !]■ The minimization is over the possible 

permutations tt G PERM of the columns of D and can be efficiently obtained with the Hungarian 
algorithm for bipartite matching. For the evaluation of topic recovery in the real data case, we use 
an approximation of the log-likelihood for held out documents as the metric [16]. See Appendix G .6 
for more details. 

"'http : / /WWW . cs . Columbia . edu/~blei/1 da-c 
^ http://ai.stanford.edu/~gal/data 

® https://archive.ics.uci.edu/ml/datasets/Bag+of+Words 
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Figure 1: Comparison of the diagonalization algorithms. The topic matrix D and Dirichlet parameter c are 
learned for iT = 50 from AP; c is scaled to sum up to 0.5 and b is set to fit the expected document length 
L = 200. The semi-synthetic dataset is sampled from GP; number of documents N varies from 1,000 to 
50, 000. Left: GP/DICA moments. Right: LDA moments. Note: a smaller value of the -error is better. 


We use our Matlab implementation of the GP/DICA cumulants, the LDA moments, and the diag¬ 
onalization algorithms. The datasets and the code for reproducing our experiments are available 
online.^ In Appendix G.l, we discuss the complexity and implementation of the algorithms. We 
explain how we initialize the parameter cq for the LDA moments in Appendix G.3. 


5.1 Comparison of the diagonalization algorithms 

In Figure 1 , we compare the diagonalization algorithms on the semi-synthetic AP dataset for K = 50 
using the GP sampling. We compare the tensor power method (TPM) [4], the spectral algorithm 
(Spec), the orthogonal joint diagonalization algorithm (JD) described in Algorithm 1 with different 
options to choose the random projections: JD(k) takes P — K vectors Up sampled uniformly from 
the unit ^ 2 -sphere in K.^ and selects Vp = Up (option (a) in Algorithm 1); JD selects the full basis 
Cl,..., Cif in and sets Vp = Cp (as JADE [21]) (option (b) in Algorithm 1); JD{f) chooses 
the full canonical basis of as the projection vectors (computationally expensive). 

Both the GP/DICA cumulants and LDA moments are well-specified in this setup. However, the 
LDA moments have a slower finite sample convergence and, hence, a larger estimation error for the 
same value N. As expected, the spectral algorithm is always slightly inferior to the joint diagonal¬ 
ization algorithms. With the GP/DICA cumulants, where the estimation error is low, all algorithms 
demonstrate good performance, which also fulfills our expectations. However, although TPM shows 
almost perfect performance in the case of the GP/DICA cumulants (left), it significantly deteriorates 
for the LDA moments (right), which can be explained by the larger estimation error of the LDA 
moments and lack of robustness of TPM. The running times are discussed in Appendix G.2. Over¬ 
all, the orthogonal joint diagonalization algorithm with initialization of random projections as 
multiplied with the canonical basis in (JD) is both computationally efficient and fast. 

5.2 Comparison of the GP/DICA cumulants and the LDA moments 

In Figure 2, when sampling from the GP model (top, left), both the GP/DICA cumulants and LDA 
moments are well specified, which implies that the approximation error (i.e., the error w.r.t. the 
model (mis)fit) is low for both. The GP/DICA cumulants achieve low values of the estimation error 
already for N = 10,000 documents independently of the number of topics, while the convergence 
is slower for the LDA moments. When sampling from the LDA-fix(200) model (top, right), the 
GP/DICA cumulants are mis-specified and their approximation error is high, although the estimation 
error is low due to the faster finite sample convergence. One reason of poor performance of the 
GP/DICA cumulants, in this case, is the absence of variance in the document length. Indeed, if 
documents with two different lengths are mixed by sampling from the LDA-fix2(0.5,20,200) model 
(bottom, left), the GP/DICA cumulants performance improves. Moreover, the experiment with a 
changing fraction 7 of documents (bottom, right) shows that a non-zero variance on the length 
improves the performance of the GP/DICA cumulants. As in practice real corpora usually have a 
non-zero variance for the document length, this bad scenario for the GP/DICA cumulants is not 
likely to happen. 


https : //github . com/anastasia-podosinnikova/dica 
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Figure 2: Comparison of the GP/DICA cumulants and LDA moments. Two topic matrices and parameters ci 
and C 2 are learned from the NIPS dataset for K = 10 and 90; ci and C 2 are scaled to sum up to cq = 1. 
Four corpora of different sizes N from 1,000 to 50,000: top, left: b is set to fit the expected document length 
L = 1300; sampling from the GP model; top, right: sampling from the LDA-fix(200) model; bottom, left: 
sampling from the LDA-fix2(0.5,20,200) model. Bottom, right: the number of documents here is fixed to 
N = 20, 000; sampling from the LDA-fix2('y,20,200) model varying the values of the fraction 7 from 0 to 1 
with the step 0.1. Note: a smaller value of the fi-error is better. 



-•-JD-GP 
-♦-JD-LDA 
Spec-GP 
■■■*■■■ Spec-LDA 
-*-VI 
-♦-VI-JD 



Figure 3: Experiments with real data. Left: the AP dataset. Right: the KOS dataset. Note: a higher value of 
the log-likelihood is better. 


5.3 Real data experiments 


In Figure 3, JD-GP, Spec-GP, JD-LDA, and Spec-LDA are compared with variational inference (VI) 
and with variational inference initialized with the output of JD-GP (VI-JD). We measure the held 
out log-likelihood per token (see Appendix G.7 for details on the experimental setup). The orthogo¬ 
nal joint diagonalization algorithm with the GP/DICA cumulants (JD-GP) demonstrates promising 
performance. In particular, the GP/DICA cumulants significantly outperform the LDA moments. 
Moreover, although variational inference performs better than the JD-GP algorithm, restarting varia¬ 
tional inference with the output of the JD-GP algorithm systematically leads to better results. Similar 
behavior has already been observed (see, e.g., [30]). 


6 Conclusion 


In this paper, we have proposed a new set of tensors for a discrete ICA model related to LDA, where 
word counts are directly modelled. These moments make fewer assumptions regarding distributions, 
and are theoretically and empirically more robust than previously proposed tensors for LDA, both 
on synthetic and real data. Following the ICA literature, we showed that our joint diagonalization 
procedure is also more robust. Once the topic matrix has been estimated in a semi-parametric way 
where topic intensities are left unspecified, it would be interesting to learn the unknown distributions 
of the independent topic intensities. 

Aknowledgements. This work was partially supported by the MSR-Inria Joint Center. The authors 
would like to thank Christophe Dupuy for helpful discussions. 
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A Appendix. Plate diagrams for the models from Section 2 


c c 
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Figure 4: Plate diagrams for the models from Section 2. 


In Section 2, the index n, which stands for the n-th document, was omitted. For convenience, we 
recall the models. The LDA model in the tokens representation: 


9n ^ Dirichlet(c), 

Zni\0n ^ Multinomial(l, 6»„), (18) 

Wne\z„e,9n ^ Multinomial(l, 


the LDA model with the marginalized out latent variable z: 


the GP model: 


and the DICA model: 


9n ^ Dirichlet(c), 

Xn\dn Multinomial(L„, 

a„fe Gamma(cfe, b), 

^ PoisSOn([L10iyi]yj.j), 

a„i,..., anK ^ mutually independent, 
Xnmlctn ~ PoisSOn( . 


(19) 


( 20 ) 

( 21 ) 


B Appendix. The GP model 

B.l The connection between the LDA and GP models 

To show that the LDA model (2) with the additional assumption that the document length is modeled 
as a gamma-Poisson random variable is equivalent to the GP model (3), we show that: 

- when modeling the document length L as a Poisson random variable with a parameter A, 
the count vectors xi, X 2 , ■ ■ ■, xm are mutually independent Poisson random variables; 

- the Gamma prior on A reveals the connection at = A0fe between the Dirichlet random 
variable 9 and the mutually independent gamma random variables ai, •. ■, a^. 

For completeness, we repeat the known result that if L ^ Poisson(A) and x\L ^ 
Multinomial(L, iA0) (which thus means that L = with probability one), then xi, X 2 , 

..., Xm are mutually independent Poisson random variables with parameters A \D9]^, A [D9]2, ..., 
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A \D9]j^. Indeed, we consider the following joint probability mass function where x and L are 
assumed to be non-negative integers: 


p{x,L\e,\) =p{L\\)p{x\L,e) 

exp (-A) 


=1 






nw 


exp(-A^[i?0]„)A^---n 


ml 


=1 




n 


eM-Hmjmmm 


nPoisson(xm;A[I?6»]^), 

m 

where in the third equation we used the fact that 


[D9]^ = DmkOk = y]] Ok Dmk = 1- 

m m,k k m 


We thus have p{x, L\6, A) = p{L\x) Y\^p{xm\^,D9]m) where p{L\x) is simply the deterministic 
distribution andp(a:m|A[lj0]m) form = 1,... ,M are independent Poisson( A [iA0]m) 

distributions (and thus do not depend on L). Note that in the notation introduced in the paper, 
Dmk = dkm- Hence, by using the construction of the Dirichlet distribution from the normalization 
of independent gamma random variables, we can show that the LDA model with a gamma-Poisson 
prior over the length is equivalent to the following model (recall, that cq = Cfc): 

A ^ Gamma(co, h), 

9 ^ Dirichlet (c), (22) 

Xm|A, 0 - Poisson([iA(A0)]^). 


More specifically, we complete the second part of the argument with the following properties. When 
ai, 02 , ..., ax are mutually independent gamma random variables, each ^ Gamma(c/c, b), 
their sum is also a gamma random variable ^ Gamma(^j, Ck, h). The former is equivalent 

to A. It is known (e.g., [32]) that a Dirichlet random variable can be sampled by first sampling 
independent gamma random variables (o^) and then dividing each of them by their sum (A): 9k = 
Ofc / 'Yhk' , and, in other direction, the variables Ofe = X9k are mutually independent, giving back 
the GP model (3). 


B.2 The expectation and the variance of the document length for the GP model 


From the drivations in Appendix B.l, it follows that the document length of the GP model (3) is a 
gamma-Poisson random variable, i.e., L\X ^ Poisson(A) and A ~ Gamma(co,&). Therefore, the 
following follows from the law of total expectation and the law of total variance 


E(L) = E [E(L|A)] = E(A) = co/b 

var(L) = var [E(L|A)] -f E [var(L|A)] = var(A) -t- E(A) = co/b + co/6^ 


The first expression shows that the parameter b controls the expected document length E(L) for a 
given parameter cq: the smaller b, the larger E(L). On the other hand, if we allow cq to vary as 
well, only the ratio co/b is important for the document length. We can then interpret the role of cq 
as actually controlling the concentration of the distribution for the length L (through the variance). 
More specifically, we have that: 


var(L) 1 ^ 1 

(E(L))2 " ^ + 


(23) 


For a fixed target document length E(L), we can increase the variance (and thus decrease the con¬ 
centration) by using a smaller Cq. 
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C Appendix. The cumulants of the GP and DICA models 
C.l Cumulants 

For a random vector x G the first three cumulant tensors^ are 
cum(a;m) = 

CUin(x 7 m 7 ~ [(^mi )) (2^7712 ^(^7712)] “ COv(Xm,i 5 ^m 2 ) j 

Crim(x777,7 , X777,2 7 [(^mi ]E(^mi))(^m2 ]^(^m2 )) (^ma ^(^Jria)))]- 

Note that the 2nd and 3rd cumulants coincide with the 2nd and 3rd central moments (but not for 
higher orders). In the following, cum(x, x, x) G (jgnotes the third order tensor with ele¬ 

ments cum(xmi, Xra 2 1 ^Cma )■ Some of the properties of cumulants are listed below (see [5, chap. 5]). 
The most important property that motivate us to use cumulants in this paper (and the ICA literature) 
is the independence property, which says that the cumulant tensor for a random vector with inde¬ 
pendent components is diagonal (this property does not hold for the (non-central) moment tensors 
of any order, and neither for the central moments of order 4 or more). 

- Independence. If the elements of x G are independent, then their cross-cumulants 
are zero as soon as two indices are different, i.e., cum(xmi,a:m 2 ) = ^(wi,m 2 )E[(x„jj — 

and cum(xm 7 ,Xm 2 ,a;m 3 ) = 5{mi,m2,rn3)E[{xrm - E(x„ 7 j)^], where 5 is the 
Kronecker delta. 

- MultiUnearity. If two random vectors y G and a G are linearly dependent, i.e., 

y = Da for some D G then 

cum(y^) = y^cmn{ak)Drnk, 
k 

CUm(yj7jj^, y77i2 ) — ^ ^ 0.k2^Dmiki^m2k2i 

ki ,^2 

Clliri(y77j^ , : ym3 ) — ^ ^ CUiri(Q(fcj^, 01^21 Q^fc3)-^mifci-^m2fc2-^?Tl3fc3; 

ki,k2,k3 


which can also be denoted"^ by 

E(t/) = DE{a), 
cov{y,y) = Dcov{a,a)D^, 
cnm{y,y,y) = cum{a,a,a){D^, , D^). 


- The law of total cumulance. For two random vectors x G and y G K.*^, it holds 
cum(xm) = E [E{xm\y)], 

cvLm{xmi,Xm 2 ) = E [cov(x„7,,x^^|t/)] -|-cov[E(x„7j?/),E(x,„2|t/)], 
cum(xmi, Xm2 , Xma) = E [cum(x,„7 , Xma , Xma Iv)] + CUm ^{Xma\y),E{Xm2\y),^iXma Iv)] 
-f cov[E(x 7 „Jy),cov( 

^m2 J ^m3 |y)] 

-|-cov[E(xm2l?/),cov( 

^mi J ^7713 \y)] 

+ COV [E(x,„3 \y), COY{Xmi, Xm^\y)] . 

Note that the first expression is also well known as the law of total expectation or the tower 
property, while the second one is known as the law of total covariance. 


^Strictly speaking, the (scalar) n-th cumulant of a random variable X is defined via the cumulant- 
generating function g{t), which is the natural logarithm of the moment-generating function, i.e g{t) := 
logE [e*^]. The cumulant is then obtained from a power series expansion of the cumulant-generating 
function, that is g(t) = /n\ [Wikipedia]. 

®In [4], given a tensor T G T{D^, , D^) is referred to as the multilinear map. In [34], the 

same entity is denoted by T Xi X 2 X 3 , where x„ denotes the n-mode tensor-matrix product. 
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C.2 The third cumulant of the GP/DICA models 


In this section, by analogy with Section 3.1, we derive the third GP/DICA cumulant. 

As the third cumulant of a Poisson random variable Xm with parameter is E((a;m — 
^{xm))^\ym) — Um, then by the independence property of cumulants from Section C.l, the cu¬ 
mulant of x|j/ is diagonal: 

cn'm{xrm,Xm2,Xm3\y) = S{mi,m2,m3)yrm- 

Substituting the cumulant of x\y into the law of total cumulance, we obtain 

CUm(Xm^ , X'iyL2 7 Xms ) — E [cum(x772^ , X^ri^ 7 Xjn^ |y)] 

+ cum [E(xmi \y),E{xm 2 1 //). ]E(a:m 3 1//)] + cov [E(xm, \y), cov(x 7712 J ^7713 |y)] 

-f cov [E(Xm 2 \y)7 COv(Xmi , Xm3 I?/)] + COV [E(Xm 3 I//); COv(Xmi, Xm2 I//)] 

= (5(mi,m2,TO3)E(j/mJ -f cum(j/m^, 2/m3) 

-I- (5(m2,TO3)cov(ymi,?/m2) + ^(wi,TO3)cov(ymi,2/m2) + <5(wi,m2)cov(ymj,?/m3) 

= 5(mi,m2,TO3)E(XmJ -I- cum{ymi7ym2 7ym3) 

+ S(m2,m3)cov(xmi,Xm2) - S{mi,m2 7m3)E{xrm) 

+ 6{mi,m3)cov{Xmi,Xm2) - l^("li,TO2,m3)E(XmJ 

-I- ( 5 (mi,TO 2 )cOv(Xmi, 2 : 7713 ) - l 5 (mi,TO 2 ,? 2 l 3 )E(XmJ 

= cum(y 77 i 7 ,?/ 7 „ 2 , 2 / 7773 ) - 2 < 5 (TOi,m 2 ,TO 3 )E(x, 77 i) 

-f (5(m2,TO3)cOv(x777i,X7772 ) + :5 (toi,TO3)cOv(x 7„7,X7773) -f i5(toi,TO2)cOv(x7„7,X7713) 

= [cum(a,a,Q!)(D^,i:)^,i:)^)]^^^^^^ - 2,5 (toi, m 2 , TO 3 )E(x,„ 7 ) 

-I- (5(m2,m3)cov(x77ii,X7772 ) + S{mi,m3)cov{Xmi7 Xm2) + S{mi,m2)cov{Xmi7Xm3)7 (24) 
where, in the third equality, we used the previous result from (9) that cov{y, y) = cov(x, x) — 
diag(E(x)). 

C.3 The diagonal structure of the GP/DICA cumulants 

In this section, we provide detailed derivation of the diagonal structure (11) of the matrix S' (10) and 
the diagonal structure ( 13) of the tensor T (12). 

From the independence of ai, a 2 , ■ ■ ■, Q!^ and by the independence property of cumulants 
from Section C.l, it follows that cov{a,a) is a diagonal matrix and cum(a,a,a) is a 
diagonal tensor, i.e., cov(afc 7 , afej = S{ki, k 2 )cov{ak 3 , ak 2 ) and cum(afe 7 , afe 3 ) = 
S{ki , k 2 , /c 3 )cum(afe 7 , , ciki)- Therefore, the following holds 

C0v(2/7777 , 2/7712 ) — ^ ' cO'v(^CXk7 ^k')k^mikk^m2k7 
k 

CUm(2/77ii , 2 /m 2 7 2/7773 ) — ^ ^ CUm(G:fc 7 o^k 7 ^k^f^mikf^m2kf^m3k 7 
k 

which we can rewrite in a matrix/tensor form as 

cov{y,y) = y^ cov{ak, ak)dkdj, 
k 

cum( 2 /, 2 /, 2 /) = y^cum(Q;fc,Qfc,afc)dfc ®dk® dk- 

k 

Moving cov{y, y) / cum( 2 /, y, y) in the expression for cov(x, x) (9) / cum(x, x, x) (24) on one side of 
equality and all other terms on the other side, we define matrix S £ /tensor T £ xMxm 

as follows 

S := cov(x, x) — diag (E(x)), (25) 

2^777 1 777 2 7773 ■ CUm(x 77 ii , X 7712 7 X 7713 ) “b ‘2S(^TTll , 7Jl2 7 m 3 )E(x 77 ii ) 

- d(m2,m3)cov(x777i,a;7772 ) 

- d(mi,m3)cov(x777i,a:7772 ) 

- d(mi,m2)cov(x777i,a:7773 ). 
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( 26 ) 


( 27 ) 


By construction, S = cov{y, y) and T = cum(?/, y, y) and, therefore, it holds that 

S = y^cov{ak,ak)dkdJ, 

k 

T ='^c\iu\{akjak,ak)dk® dk® dk- (28) 

k 

This means that both the matrix S and the tensor T are sums of rank-1 matrices and tensors, respec- 
tively'®. This structure of the matrix S and the tensor T is the basis for the algorithms considered in 
this paper. 


C.4 Unbiased finite sample estimators for the GP/DICA cumulants 


Given a sample {xi,X 2 , ■ ■ ■, xat}, we obtain a finite sample estimate S' of S' (10) / T of T (12) for 
the GP/DICA cumulants: 


S := cov(x, x) — diag 


T 

rr 


i 2(5(??7-i, 777-2, ^3)^(^mi ) 

- (5(7772, 77Z3)cOv(x 

mi 1 ^7712 ) 

- 6{mi,m3)^{Xmi,Xm2) 

- 6{mi,m2)^iXmi,Xm3), 

where unbiased estimators of the first three cumulants are 


E(a:„ 




COv{Xmi,Xm 2 ) = 


N-1 


E' 


"nmi ^nm 2 t 


CXUni^Xm^ Xm2 t ^m^^ —■ 


N 




-nmi ^727712 '^nms 5 


(29) 


(30) 


(31) 


{N-l){N-2) 

where the word vocabulary indexes are mi,m 2 , m 3 = 1,2,... ,M and the centered documents 
Znm '■= Xnm — (The latter is introduced only for compact representation of (31) and is 

different from z in the LDA model.) 


C.5 On the orders of cumulants 

Note that the factorization of S = DD^ does not uniquely determine D as one can equivalently use 
S = {DU){DU)^ with any orthogonal K x K matrix U. Therefore, one has to consider higher 
than the second order information. Moreover, in ICA the fourth-order tensors are used, because the 
third cumulant of the Gaussian distribution is zero, which is not the case in the DICA/LDA models, 
where the third order information is sufficient. 


D Appendix. The sketch of the proof for Proposition 3.1 
D.l Expected squared error for the sample expectation 

The sample expectation is E(x) = ^ Xn is an unbiased estimator of the expectation and: 


e(||E(x)-E(x)||2) =^E (e(x„)-E(x„))' 


= —E 

7V2 


E I E - E(x„i))^ ) + ® f E E “ E(Xm)) {Xr, 

\ n / \ n ra/ra' 


— ^r E! ^ E(a;m)) — ^^var(xm)- 


*°For tensors, such decomposition is also known under the names CANDECOMP/PARAFAC or, simply, the 
CP decomposition (see, e.g., [34]). 


- E{Xjn)) 
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Further, by the law of total variance; 


E(||E(a;) -E(x)||^) = ^ ^ [E(var(a;„^|y)) + var(E(a;„,|y))] = ^ ^ [E(y^) + var( 2 /^)] 


N 


E(afc) + y^(rffc,dfc)var(afc) 


using the fact that Dmk = 1 for any k. 


D.2 Expected squared error for the sample covariance 

The following finite sample estimator of the covariance cov(x, x) = E(xa;^) — E(x)E(x)^ 

/ 

^ 1 


cov 


= Iv^ ^ “ E(a;)E(a;)^ = —^ 


= -Y 

N ^ 


At2 


EE^ 




_ ^ I 

n'^n j 


is unbiased, i.e., E(cov(x, x)) = cov(x, x). Its squared error is 

E (||cov(x,x) - cov(x,x)|||) = E ['(cov(xm, a^m') - E[cov(xm,a;m')])" 


(32) 


The m, m'-th element of the sum above is equal to 


^2 E 1 ^nmXnm' ^ ^ 

n,n' \ n"^n 


—X V 

^ '^nm / 


Xn" 


n"m' ) Xn'mXn'm' ^ ^Xn'm ^ ^ Xn"'m' 

n"'^n' 


— J\f2 ^ ^ COV {XnmXnm '; Xn'mXn'm ^) 7V^(A^ 1) ^ ^ COV | Xnm ^ ^ Xn"m'f Xn'mXn'm' 


n,n' 


n"^n 


J 1^2 ^ ^ COV I Xfim ^ ^ Xn"m' iXri'm ^ ^ Xn"'m' 
n,n' \ n"^n n"'^n' 


N‘^{N 


— Js^2 ^ ^ COV {XnmXnm' ? XnmXnm' ) 


A^2(iV- 1) 

1 

Ar2(Ar- 1)2 

1 

Ar2(iv- 1)2 


E E 


XnmXn"m' ■> XnmXnm' 


) + E E ( 


XnmXn'm' •) Xn'mXn'm' 


n n"^n 


n n'^n 


^ ^ ^ ^ ^ ^ COY (^XnmXn"m'•, XnmXn'"m') ^ ^ ^ ^ ^ ^ COV (x7T,77T,Xy},^^mS ) 

n n"^nn'"^n n' n^n' n"^n 


E E E 

n' n^n' n'"^n' 


XnmXn'm' •> Xn'mXn'"m‘ 


) ^ ^ ^ ^ ^ ^ COV (x7i,m^n”mS ) 


n' n^n' n"^n 


where we used mutual independence of the observations Xn in a sample {xn}^^i to conclude that 
the covariance between the two expressions involving only independent variables is zero. 
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Further: 


E(||cov(a:,a:) - cov(a:, a:)|||.) = ^ X! ^ - Mx-mXm')y 


7V2(iV- 1) 
2 


- 1) {E{xl,Xm')'E{Xm') - E{x„iXm')E{Xm)E{Xm')) 

n,m' 

N2(N - 1)2 E - 1)(^ - 2) - mxm)f [E(a:™')]') 

'' ' m,m' 

N{N -1){N - 2) {E{xmXm')E{xm)E{xTn') “ [^(a;™)]^ [E(xm')]^) + O 


iV2(7V- 1)2 

which after simplification gives 

E (||cov(x,a:) - cov(a::,a::)|||,) = ^ y^ ^vai{xmXm') + 2[E{xm)f vB,T{xm') 

m,m' 

+ ^ X! {‘^^i^rn)E{Xm')cOy{Xm,Xm>) - 4:E{Xm)cOv{XmX,n,,Xm,)] + O , 

m,m' ^ ' 

where in the last equality, by symmetry, the summation indexes m and m' can be exchanged. As 
Xm ^ Poisson(?/m), by the law of total expectation and law of total covariance, it follows, for 
m ^ m! (and using the auxiliary expressions from Section D.4): 

yaY{XmXm') = E(a:^X^,) - \^[XmXm']f = E [E(a:^X^,|j/)] - [E [E{XmXm'\y)]f 

~ [z/mZ/m' “b ymUm' + ymym' + ymym'\ ~ [E(?/mZ/m')] J 

[EiXm)f var{xm') = [E{ym)f E{ym') + [E(y^)]^ E(y^,) - [E(y^)]^ [E{ym')f , 

E(^Xfn)E(^Xfn'')CO'v{Xjn^ Xm'') — E(^ymym')E(^ym)E{ym') [^(Z/m)] [^(Z/m')] ; 

E(^Xm)^^^iXmXm' ^ Xm') — E(ym) [E(yrnZ/m') “b ^(Z/mZ/mO E(ymym')E(^ym')'\ ■ 

Now, considering the m = m' case, we have: 

var(a:^) =E[E(a:t|y)] - [E[E(x^| 2 /)]]^ 

= E [l/l + Qym + '^ylv + Z/m] - [E [ym + Z/m]] y 

E{Xm)E{Xm)cOv{Xm,Xm) = E{ym)'^ E(?/^) + E{ym) “ [E(ym)]^ , 

E(x™)cov(a:^,x^) =E{ym) [E{yl^) + 3E{yl^) + E{ym) - E{ym) [E(?/^) + E(j/„)]] . 
Substitution of ym = ^mkOtk gives the following 

E (||cov(a;,a::) - cov(a:,a:)|||,) =-^ y^ {dk,dk'){dk" ,dk'")Akk'k"k'" 


1 


k,k' ,k",k"' 




k,k' ,k" 


+ 


— ^ ^{dk, l)(c?/cs l)E(afcafcO + {dk,dk')J^kk' 

^ k,k' 

y](4,r)E(afe)+o(^^^, 


where 1 is the vector with all the elements equal to 1 and 

Akk'k"k'" = E{akak'ak"ak"') — E{akak")E{ak'ak'") + 2E{ak)E{ak')E{ak"ak"') 

— 2E(afc)E(afc')E(afe//)E(afc/'/) + 2E{akak")E{ak')E{ak'") — 2E{ak)E{ak')E{ak")E{ak'") 

- AE{ak)E{ak'ak"ak"') + AE{ak)E{ak'ak")E{ak"'), 

Bkk'k" = 2E{akak'ak") + 2E(afe)E(afc/)E(afc") — AE{ak)E{ak'ak") ^ 

£kk'k" = 4E(afeQ!fc/a/c//) + 6E(afe)E(afe/)E(Q!fc//) - 10E(Q!feafc/)E(afc//), 

Tkk' = 6E(afeafe/) - 5E(afe)E(afc/), 
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where we used the expressions from Section DA. 


D.3 Expected squared error of the estimator S for the GP/DICA cumulants 

As the estimator S (29) of S (10) is unbiased, its expected squared error is 


E 


115-^111 


=E 


(cov(a;,a;) — cov(a;,a:)) + ^diag[E(a:)] — diag[E(a;)]^ 

= E p|E(a:) - E(a;)|||. +E [||cov(a;,cc) - cov(a;,a;)|||.] 

~t~ 2 ^ E ^E(x^) E(x^)^ (cov(x^,x^) cov(xyj^, 

m 

As E(xyn) and cov{xm^Xm) are unbiased, the m-th element of the last sum is equal to 

COV |^E(Xrri)5 COv(Xrri, Xyn) 

= ^ - 


( 33 ) 


^2(lr _A\ E nmi ^n'm^n''m\ 

^ ' n,n',n" 7 ^n' 




^nm-) a^nmj 


A^ 2 (iV- 1 ) 


^ ^ COV [3^nm: ^n'm^nm] H“ ^ ^2 j 


n,n' 7 ^n 


1 

~N 

1 


E(X^) - I (E(x^)E(Xyyy) - [E(Xy„)]^) + O ^ 




f 1 ^ 

1 

\m 

“ N 


^{Vm) + 3E(i/yyj) + E{yjn) + 2 [E(ym)] 


O 


1 

ml ’ 


where we neglected the negative term —E(x^)E(xm) for the inequality, and the last equality follows 
from the expressions in Section D.4. Further, the fact that ym = ^mkOtk gives 

^COV E(Xy„),COT( 

5 )1 


N 


^ ^ {dk ^ dk' •) dk")Ckk'k" 

k,k',k" 

+ '^{dk,dk')E{akak') + — '^{dk, l)E(afc) + O > 

fc.fc' k ^ ^ 

where o denotes the elementwise Hadamard product and 

Ckk'k" = E{akOik'Oik") + 2E(afe)E(afe/)E(afe"). 

Plugging this and the expressions for E(||E(x) — E(x)|||h) and E(||cov(x, x) — cov{x, x)|||n) from 
Sections D.l and D.2, respectively, into (33) gives 

1 

' / 

k k 


E 


11-5-511^ 


N 


^ '^^ {dkf *^fc)^ar(Qfc) "h ^ ^ E(Q(fc) -f- ^ ^ {dk^dk'}{dk" jdk'"}^kk'k'^k" 

k,k' ,k" ,k"' 


+ 0 


1 

N 


^ ^ [{dk 1 dk'}^kk'k" 2 {dk ® dk^ j dk" 'jC'kk' k"] ^ ^ (1 -h 6 {dk ■, dk^ )) ^{(^k^k' ) ^ ^ ^ ^{^k ) 


k.k'k" 


k,k' 


where we used that, by the simplex constraint on the topics, {dk, 1) = 1 for all k. To analyze this 
expression in more details, let us now consider the GP model, i.e., ak ^ Gamma(cfc, b)\ 

30cq + 23cq + 14cq + 8co , „ 6 cq + IOcq + 4co 

/ Akk'k"k'" S - 74 -, and ^ Okk'k" S — 


k,k' ,k",k" 


k,k',k" 


63 


7co + 6co + 2co ^ 12co + IOcq + 8co 

/ tkk'k" S - 73 -, and 2_^ Ckk'k" S — 


k,k' ,k" 


k,k' 


63 


J^kk' < 

fc.fc' 


2co + co ^ ^ 2co + Co 


62 


and y^E(afeQ;fc') < 

k,k' 


62 
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where we used the expressions from Section DA, which gives 


E 


\\s-s\\l 


< 


N 


max lldfell 


2 ^0 


ly 

N 


max {dk o dk',dk") I max 
k,k',k" ’ 


h 

’’cq Co 

53 ’ 63 


( m^{dk,dk') ) max 


Co 

64’64 


+ max(cifc, dk') max 

k,k' 


Cq Co 
63’ 63 


+ 1 + max((ifc, dk>) I max 


'"co Co 
62’ 62 


o 


1 

w 


where 1 / < 30 is a universal constant. As, by the Cauchy-Schwarz inequality, max k,k' {dk,dk') < 
maxfe ||dfe ||2 =: Ai and maxfc,fc',fc"(dfc odk',dk") < maxfe |l(i/c||^ \\dk\\l < max^ \\dk\\l =■ A 2 
(note that for the topics in the simplex, A 2 < Ai as well as A2 < Ai), it follows that 


E 


|| 5 - 

2iy 1 


< 


V 

N 


Co 


+ 


L3 


+ + A 


+ ^ + 


< — ^ [AfL^ 
~ N cl'- ^ 


+ CqAiL^ + Cq^^ + Cq^] 


O 


Co 

1 


A 

A 2 ^ 


Co J 


+ 0 


1 

IP 


where cq = min(l,co) < 1 and, from Section B.2, co = bL where L is the expected document 
length. The second term co AiL 3 cannot be dominant as the system cq AiL 3 > and coAiL3 > 
A2l 4 is infeasible. Also, with the reasonable assumption that L > 1, we also have that the 4th term 
CqL < Therefore, 


E 


\\s-s\\l 


< 


'n 


■ max 


[A?L4, 


^ c-L^] 


O 



D.4 Auxiliary expressions 

As are conditionally independent given y in the DICA model (3), we have the following 

expressions by using the law of total expectation for m ^ m! and using the moments of the Poisson 
distribution with parameter 

E(xm) = E[E(a;m|2/m)] = ^{Vm), 

~ + E(j/m), 

= E[lE(a;^|ym)] = Hy?n) + 3E(y^) + E(2/^), 

E(x^) = E[E(a;t|ym)] = Hyt) + 6E(t/i) + 7E(y^) + E(i/m), 

E(xmXm') — E[E(xmXyj.j/|y)] = E[E(xm ly^/)] = E(y^y^/), 

E(xma;^,) = E[E(x„a;^,|y)] = E[E(a;m|ym)E(a;^,|y,„/)] = E{y^yl^,) +E(y™y™<)> 

= E[E(a::^|y„)E(x^,|y„/)] = E(y^y^,) + E(y^y™/) + E(y„y^,) + E(y™y„/). 


Moreover, the moments of ak ^ Gamma(cfc, 6 ) are 


E(afe) = j, 


E{al) = 


Cfc +Cfc 
62 ’ 


E(aD = 




2ck 


63 


E(a|)= 


6 c| 




6 cfc 


64 


etc. 


D.5 Analysis of the whitening and recovery error 


We can follow a similar analysis as in Appendix C of [15] to derive the topic recovery error given the 
sample estimate error. In particular, if we define the following sampling errors Es and Ex'- 

\\S-S\\<Es, 

\\f{u)-T{u)\\ < \\u\\^Et, 

then the following form of their Lemma C.2 holds for both the LDA moments and the GP/DICA 
cumulants: 


\\Wf{W^u)W^ 


WT{W^u)W^\\ < ly 


{ma,Xkjk)Es 

OK {of 


Ej' 

OK {of 


(34) 
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where crfc(-) denotes the fc-th singular value of a matrix, v is some universal constant, and in both 
cases D was defined such that S = DD^. For the LDA moments, 7 ^ = whereas for 

the GP/DICA cumulants, 7 fc takes the simpler form 7 ^ := cum(afc)/[var(afe)]^/^ = 2/y/ck. 

We note that the scaling for S is 0{L‘^) for the GP/DICA cumulants, in contrast to 0(1) for the LDA 
moments. Thus, to compare the upper bound (34) for the two types of moments, we need to put it in 
quantities which are common. In the first section of the Appendix C of [15], it was mentioned that 

(Tiv'(O) > ^LDA moments, where Cmin := min/c Ck- In contrast, for the 

GP/DICA cumulants, we can show that aK{D) > crj^(O), where L := co/b is the average 

length of a document in the GP model. Using this lower bound for the singular vector, we thus get 
the following bound in the case of the GP cumulant: 


\\wf{W^u)W^ 


WT{W^u)W^\\ < 

^min 


Es 2co 


Et Cq 
[aK{D)f 


(35) 


The factor is common for both the LDA moment and GP cumulant, but as we mentioned 
after Proposition 3.1, the sample error Es term gets divided by for the GP cumulant, as ex¬ 
pected. 


The recovery error bound in [15] is based on the bound (35), and thus by showing that the error 
Es /for the GP cumulant is lower than the Es term for the LDA moment, we expect to also 
gain a similar gain for the recovery error, as the rest of the argument is the same for both types of 
moments (see Appendix C.2, C.3 and C.4 in [15] for the completion). 


E Appendix. The LDA moments 

E.l Our notation 


The LDA moments were derived in [3]. Note that the full version of the paper with proofs appeared 
in [15] and a later version of the paper also appeared in [31]. In this section, we recall the form of the 
LDA moments using our notation. This section does not contain any novel results and is included 
for the reader’s convenience. We also refer to this section when deriving the practical expressions 
for computation of the sample estimates of the LDA moments in Appendix F.4. 


For deriving the LDA moments, a document is assumed to be composed of at least three tokens: 
L > 3. As the LDA generative model (1) is only defined conditional on the length L, this is not 
too problematic. But given that we present models in this paper which also model L, we mention 
for clarity that we can suppose that all expectations and probabilities defined below are implicitly 
conditioning on L >3." The theoretical LDA moments are derived only using the first three words 
wi, W 2 and W 3 of a document. But note that since the words Wi’s are conditionally i.i.d. given 9 (for 
1 < £ < L), we have M 3 := E(wi 0 ^ 2 ® W 3 ) = 0 Wi^ 0 we^) for any three distinct tokens 

£1, £2 and £3. The tensor M3 is thus symmetric, and could have been defined using any distinct 
£1, £2 and £3 that are less than L. To highlight this arbitrary choice and to make the links with the 
U-statistics estimator presented later, we thus use generic distinct £1, £2 and £3 in the definition of 
the LDA moments below, instead of £3 = 1, £2 = 2 and £3 = 1 as in [3]. 


"Note that another advantage of the DICA cumulants from Section 3.1 is that they do not require such a 
somewhat artificial condition: they are well-defined for any document length (even a document of length zero!). 
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Using this notation, then by the law of total expectation and the properties of the Dirichlet distribu¬ 
tion, the non-central moments'^ of the LDA model (1) take the form [3]: 


Ml 

M 2 

M 3 


E(w^J = D—, 

Co 

= —^MiM 7 -f / ^ , 

Co -f 1 Co(Co -f 1) 

0 wi^ ® wi^) 

—(g) wi^ 0 Ml) -I- ® Mi® ) + E(Mi ® wi^ ® wg^)], 

Co+ 2 


2 cg 

co(co + l)(co + 2) 


Ml® Ml® Ml + 


2 

co(co + l)(co + 2) 


K 

fc=l 


(36) 

(37) 


(38) 


where g) denotes the tensor product. 


Similarly to the GP/DICA cumulants (as discussed in Appendix C.3), moving the terms in the non¬ 
central moments (36), (37), (38), the following quantities are defined 


Co 


(Pairs) = S := M 2 --MiM, , 

^ ’ Co -f 1 ^ 


LDA S-moment 

(39) 


Co 


(Triples) =T := M 3 -- [E(?w^j ® wg^ 0 Mi) + E(rc.^j ® Mi ® wg^) + E(Mi ® wg^ ® wg^)] 


Co + 2 

2cg 


(co + l)(co + 2) 


Ml® Ml® Ml. 


LDA T-moment 


(40) 


Slightly abusing terminology, we refer to the entities S and T as the “LDA moments”. They have 
the following diagonal structure 


S = 


1 

co(co + 1) 


K 

'y ^ Ckdkdk ) 

fe=l 


T = 


2 

co(co + l)(co + 2) 


K 

y^Ckdk®dk®dk. 

k=l 


(41) 

(42) 


Note however that this form of the LDA moments has a slightly different nature than the similar 
form (11) and (13) of the GP/DICA cumulants. Indeed, the former is the result of properties of the 
Dirichlet distribution, while the latter is the result of the independence of a’s. However, one can 
think of the elements of a Dirichlet random vector as being almost independent (as, e.g., a Dirichlet 
random vector can be obtained from independent gamma variables through dividing each by their 
sum). Also, this closeness of the structures of the LDA moments and the GP cumulants can be 
explained by the closeness of the respective models as discussed in Section 2. 


E.2 Asymptotically unbiased finite sample estimators for the LDA moments 

Given realizations Wng, n = 1,..., A^, f = 1,..., Ln, of the token random variable wg, we now 
give the expressions for the finite sample estimates of S (39) and T (40) for the LDA model (and 
we rewrite them as a function of the sample counts a;„).'^ We use the notation E below to express 
a U-statistics empirical expectation over the token within a documents, uniformly averaged over the 

whole corpus. For example, E(wg^ ®wg.^®Mi) := ^ l (l -L ®wg.^ ® 

'^Note, the difference in the notation for the LDA moments in papers [3] and [4]. In [3], Mi = E(ii;^j), 
Ml = E(m;£^ 0 wg.^), and M 3 = E(w£j ® wg^ 0 wg^). However, in [4], M 2 is equivalent to S in our notation 
and to Pairs in the notation of [3]; similarly, M 3 is T in our notation or Triples in the notation of [3]. 

*^Note that because non-linear functions of Mi appear in the expression for S (43) and T (44), the estimator 
is biased, i.e., E(S') 7 ^ S. The bias is small though: IIE(S') —S|| = 0(1/N) and the estimator is asymptotically 
unbiased. This is in contrast with the estimator for the GP/DICA moments which is easily made unbiased. 
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(43) 


Ml. 


S — M 2 - —MiM' , 

Co + 1 


T-M 3 - 


Co 


Co + 2 - 
2c2 


£(■ 0 ;^^ ‘Si ‘Si Ml) + S Mi S we^) + E(Mi 0 S wi^) 


-Ml sMiS Ml, 


(co + l)(co + 2) 

where, as suggested in [4], unbiased U-statistics estimates of Mi, M 2 and M 3 are: 

N 

TV ^N 

N 

N E t„(i- - 1 ) t. 


1 A 1 


^ 1 1 
n=l ^=1 


Ml :=£(«;(.) = ^ XI TT ^ ^ 

n—1 

Ln Lr, 


M2 := E(wi!iw7J = ^ X 


X X ^"^1' 




1 tv / L„ \ 

= X['^2]n Xnxl - X ^Cnt?w7^ 

n=l \ i=l ) 

1 ^ 

= X['^ 2 ]n (at„a:7 - diag(x„)) 


^2 


(44) 


(45) 


= ^ [Xdiag(()2)^^ - diag(Xd2)] 


(46) 

(47) 


^ ^ tv L„ L„ L„ 

M3 := E(u;^^ ® (g) ut^) = — X '^sn X X X 


) w„i>2 (g) 


n=l {1 = 1 £2 = 1 £3 = 1 
£2^£i £ 3^13 
S-3+S-r 


N 


= — X ['^3]n [ atn (g> at„ 0 X„ - X '^ri£ S Wnl S Wn£ 


n—1 
L„ 


£=1 


\ 


-X X (Utn^i 0 Wn £3 0 Wni2 + Wnf^ 0 W „^2 0 '^Xn£i + Wn £3 0 Wni2 0 ^nra ) 


^1 = 1 ^2 = 1 

1 ^ 

= ]^E|t3l 


/ 


M 


n I ^ ^ 


2 ^ ^ Xnm ( 


em 09 Cm 09 emj 


MM \ 

-X X ^nmi ^ri.m2 (^mi ^ ^mi ^ Cm2 ^771^^ 0 ^777,2 ^ ^777^^ -\- Cmi ^ ^7712 ^ C7772 ) | ■ 

mi = l 7712 — 1 / 


(48) 

Here, the vectors Si, S 2 and ^3 € are defined as [(5i]^ := L~^; [ 62 ]^ ■= {Ln{Ln — 1))”^, i-e., 
n -1 

is the number of times to choose an ordered pair of tokens out of Ln tokens; 


[<52]„= (792! 


[d 3 ]„ := {LniL^ - 1 )(L„ - 2 ))-!, i.e., [d 3 ]„ = (793! 


is the number of times to choose an 


ordered triple of tokens out of tokens. Note that the vectors (5i, <52, and <53 have nothing to do 
with the Kronecker delta S. 

For a vector a G K'^, we sometimes use notation [a]„ to denote its n-th element. Similarly, for a 
matrix A G we use notation to denote its (m, n)-th element. 
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There is a slight abuse of notation in the expressions above as wn is sometimes treated as a random 
variable (i.e., in E('u;£), etc.) and sometimes as its realization. However, the difference 

is clear from the context. 


F Appendix. Practical aspects and implementation details 
F.l Whitening of S and dimensionality reduction 


The algorithms from Section 4 require the computation of a whitening matrix W of S. Due to 
the similar diagonal structure ((41) and (11)) of the matrix S for both the LDA moments (39) and 
the GP/DICA cumulants (10), the computation of a whitening matrix is exactly the same in both 
cases. 


By a whitening matrix, we mean a matrix W G (in practice, M ^ K) that does not only 

whiten S € , but also reduces its dimensionality such that'^ W= Ik- 


Let S = UTjU^ be an orthogonal eigendecomposition of the symmetric matrix S. Let Tji-k denotes 
the diagonal matrix that contains the largest K eigenvalues'^ of S on its diagonal and let Ui-k be a 
matrix with the respective eigenvalues in its columns. Then, a whitening matrix is 

W = (49) 


fl/2 

where S[.k is a diagonal matrix constructed from Si-k by taking the inverse and the square root 
of its non-zero diagonal values (i stands for the pseudo-inverse). 


In practice, when only a finite sample estimator S' of S' is available, the following finite sample 
estimator W of W can be introduced 


W := 


:iC) 


(50) 


where S = 


F.2 Computation of the finite sample estimators of the GP/DICA cumulants 


In this section, we present efficient formulas for computation of the finite sample estimate (see 
Appendix C.4 for the definition of T) of WT{v)W^ for the GP/DICA models. The construction 
of the finite sample estimator W is discussed in Appendix F.l, while the computation of S (29) is 
straightforward. 


'“'Note that such a whitening matrix W G ^ is not uniquely defined as left multiplication by any orthog¬ 
onal matrix V G does not change anything. Indeed, let W = VW, then WSW^ = VWSW^ = 

Ik- 

*^We mean the largest non-negative eigenvalues. In theory, S have to be PSD. In practice, when we deal with 
finite number of samples, respective estimate of S can have negative eigenvalues. However, for K sufficiently 
small, S should have enough positive eigenvalues. Moreover, it is standard practice to use eigenvalues of S for 
estimation of a good value of K, e.g., by thresholding all negative and close to zero eigenvalues. 
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By plugging the definition of the tensor T (30) in the formula (16) for the projection of a tensor onto 
a vector, we obtain for a given v G 



mim 2 


= ^c'u5i( 

) ^7712 5 ^7713 )’^77l3 “I” 2E S{mi,m2,m3)E{xm3)vm3 

7713 

-y^6{m2,m3)^{Xmi,Xm2)Vm3 

m 3 

-y^S{mi,m3)^{Xrm,Xm2)Vm3 

m 3 

-y^S{mi,m2)^{Xrm,Xm3)Vm3 

m 3 

= ^c'mii( 

^7711 1 ^7712 7 ^m3 )^m3 + 2S{mi,m2)E{xrm)vrm 

m 3 

- ^{Xmi,Xm2)Vm2 “ COv(Xmi, ^^2 )l’mi “ (5(mi, m2 ) ^ COv(a;mi, )t'm3 • 

m 3 


This gives the following for the expression VFT(u)fK 


T. 


WT(v)W^ = WkiT{v)Wk 2 

-I fcifc 2 


= cum( 

^TTli : ^7712 7 ^m3 )^m3 

mi ,m2 ,m3 

“t" 2 ^ ^ ^2)®^(^771 i )'^77li ^^fciTTli ^^/C2m2 


mi ,7712 

- Y ^7711 : ^7712 )^m2 ^^fciTTli ^^fc2m2 

mi ,m 2 

- E ^7711 5 ^7712 )’^mi ^^fciTTii ^^A; 2 m 2 

mi ,m 2 

- E ^7711 5 ^7773 )^m 3 

mi ,m3 


where Wk denotes the fc-th row of fF as a column vector. By further plugging in the expressions (31) 
for the unbiased finite sample estimates of cot and cum, we further get 


Vl^r(t;)W^^ 


N 


ik,k 2 {N-l){N-2) 


(Wk^,Xn - (Wk^^Xn - (v,Xn - i(x) 


^ ^(^777)^^771 ^'^fci 771 ^^fc2m 

m 

- ^ E {^k,,Xn -E{x)^ (voWk 2 ,Xn - E{x)^ 

77 

- \ ^ ^(^uofFfc^,Xn - E(a;)^ (Wk2,Xn -E(a;)^ 

77 

- ^ E ° lTfe 2 .a;n -E(a:)^ (^v,Xn - E(a;)^ , 


where o denotes the elementwise Hadamard product. Introducing the counts matrix X G 
where each element X^n is the count of the m-th word in the n-th document (note, the matrix X 
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contain the vector Xn in the n-th column), we further simplify the above expression 


WT{v)W^ = 


N 


{N -1){N -2) 

N 

' {N - 1){N -2) 

N 

{N-l){N-2) L 
2 fFdiag[u o E(a;)]fF^ 


{WX)di&g[X^v]{WX)^ 

2N{WE{x)){WE{x))^ - (WX){WX)^ 
WX{X^v){WE{x))^ +WE{x){WX{X^v))^ 


1 


iV- 1 L 
N 

N~^ 1 L 
N 

N- 1 


{WX){Wdmg{v)X)^ + {Wdmg{v)X)(WXy + Wdi&g[X{X^v)]W^ 
(W^E(x))(VFdiag[t;]E(x))^ + (fFdiag[w]E(a:))(fFE(a:))^ 
u,E(a:)^ fFdiag[E(a;)]M^^. 

(51) 


A more compact way to write down expression (51) is as follows 

N 


WT{v)W' = 


{N-1){N -2) L 


Ti + (v, E{x)) iT 2 - Ts) - (T 4 + T 4 ' ) 


1 


N-1 I 


n + -Te- + W^diag(a)W"^ 


(52) 


where 

Ti = (WX)diag[X^v](WXy, 

T 2 = 2N(WE{x))(WE{x)y, 

Ts = (WX)(WXy, 

Ti = WX{X^v)(WE{x))^, 

Ts = (WX)(Wdiag{v)X)^, 

Tq = (fKdiag(t;)E(a;))(M^E(a;))^, 
a = 2{N — l)[z; o E(a;)] + (u, E(x))E(a:) — X{X^v). 


F.3 Computational complexity of the GP/DICA T-cumulant estimator (52) 

When computing the T-cumulant P times with the formula above, the following terms are dominant: 
0{RNK) + 0{NK'^) + 0{MK), where R is the largest number of unique words (non-zero counts) 
in a document over the corpus. In practice, almost always K < M < N, which gives the overall 
complexity of P computations of the estimator (52) to be equal to 0{PRN K) + 0{PN K^). 


F.4 Computation of the finite sample estimators of the LDA moments 

In this section, we present efficient formulas for computation of the finite sample estimate (see Ap¬ 
pendix E.2 for the definition of T) of WT{v)W^ for the LDA model. Note that the construction of 
the sample estimator VL of a whitening matrix W is discussed in Appendix F. 1). The computation of 
S (43) is straightforward. This approach to efficient implementation was discussed in [4], however, 
to the best of our knowledge, the final expressions were not explicitly stated before. All derivations 
are straightforward, but quite tedious. 
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By analogy with the GP/DICA case, a projection (16) of the tensor T € ( 44 ) onto some 

vector V G in the LDA is 


T(v) 


M 


= E 


ms — l 


2ci 


J mim 2 m 3 


(co + l)(co + 2) 


Co 


M 


Co + 2 


0 10^2 0 Ml) + 0 Ml 0 10 ^ 3 ) + E(Mi 0 WI 2 0 10 ^ 3 ) 


m 3 = l 


J 


Plugging in the expression (48) for an unbiased sample estimate M 3 of M 3 , we get 


1 


N 


f{v) =]YX![^ 3 ]n [xnmiXnm 2 {Xn,v) + 2 '^d{mi,m 2 ,m 3 )Xnm 3 Vr, 


n—1 

N 


M 


- N Ei^3|» e 


+ 
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where ci, 62 , ..., Gm denote the canonical vectors of (i.e., the columns of the identity matrix 
Im)- Further, this gives the following for the expression WT{v)W^: 
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where Wk denotes the fc-th row of PF as a column-vector. This further simplifies to 

wf{v)W^ = ^(WX)di&g [(X^u) o 53 ] (WX)^ 

+ ^PFdiag [2[(X(53) ov]- X[{X^v) o , 53 ]] 1?^ 
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A more compact representation gives; 
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where 


Ti = {WX)diag [(X^i;) o ^3] (WX)^, 

T2 = Wdiag [2[(X^3) or;] - X[{X^v) o ^3]] , 

T 3 = [hFdiag(w)X]diag(53)(W"X)^, 

T4 = [W{M2v)](WMi)^ . 

F.5 Computational complexity of the LDA T-moment estimator (54) 

By analogy with Appendix F.3, the computational complexity of the T-moment is 0{RNK) -f 
0{NK^). However, in practice we noticed that the computation of (52) is slightly faster for larger 
datasets than the computation of (54) (although the code for both was equally well optimized). This 
means that the constants in 0{RNK) + 0{NK'^) for the LDA T-moment are, probably, slightly 
larger than for the GP/DICA T-cumulant. 


F.6 Estimation of the model parameters for GP/DICA model 


Below we briefly discuss the recovery of the model parameters for the GP/DICA and LDA mod¬ 
els from a joint diagonalization matrix A G estimated in Algorithm 1. This matrix has 

the property that AD should be approximately diagonal up to a permutation of the columns of D. 
The standard approach [3] of taking the pseudo-inverse of A to get an estimate of the topic ma¬ 
trix D has a problem that it does not preserve the simplex constraint of the topics (in particular, 
the non-negativity of D). Due to the space constraints, we do not discuss this issue here, but we 
observed experimentally that this can potentially significantly deteriorate performance of all mo¬ 
ment matching algorithms for topic models considered in this paper. We made an attempt to solve 
this problem by integrating the non-negativity constraint into the Jacobi-updates procedure of the 
orthogonal joint diagonalization algorithm, but the obtained results did not lead to any significant 
improvement. Therefore, in our experiments for both GP/DICA cumulants and LDA moments, we 
estimate the topic matrix by thresholding the negative values of the pseudo-inverse of A: 

dk ■■= rfcmax(0, [A'l']:fe)/|| max(0, [A’l']:fe)||i, 

where is the /c-th column of the pseudo-inverse of A, and Tk = ±1 set to —1 if [A^]:fc has 

more negative than positive values. This might not be the best option, and we leave this issue for the 
future research. 


To estimate the parameters for the prior distribution over the topic intensities ak for the DICA 
model (4), we use the diagonalized form of the projected tensor from (17) and relate it to the output 
diagonal elements Op for the p-th projection; 




cum(afc,afc,Q;fc) 

[var(afe)]3/2 


Tkdk,W Ur 


(55) 


where dfc = max(0, [A^]:fc). This formula is valid for any prior on in the DICA model. For the 
GP model (3) where ak ^ Gamma(cfe, b), we have that vai{ak) = and cum(Q;fe, ak, ak) = 
and thus tk = which enables us to estimate c^. Plugging this value of tk in (55), and solving 
for Ck gives the following expression; 
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d(dk,W^Uj, 
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By replacing the quantities on the RHS with their estimated ones, we get one estimate for Ck per 
projection. We use as our final estimate the average estimate over the projections; 




1 Pd{dk,W'u, 

~p 2^ 




^P\k 


(56) 


26 





Reusing the properties of the length of documents for the GP model as described in Appendix B.2, 
we finally use the following estimates for rate parameter b of the gamma distribution: 


where cq := ^ '^^e average document length in the corpus. 

By analogy, similar formulas for the estimation of the Dirichlet parameter c of the LDA model can 
be derived and are a straightforward extension of the expression in [3]. 

G Appendix. Complexity of algorithms and details on the experiments 
G.l Code and complexity 

Our (mostly Matlab) implementations of the diagonalization algorithms (JD, Spec, and TPM) for 
both the GP/DICA cumulants and LDA moments are available online.*^ Moreover, all datasets and 
the code for reproducing our experiments are available. To our knowledge, no efficient implemen¬ 
tation of these algorithms was available for LDA. Each experiment was run in a single thread. 

The bottleneck for the spectral, JD, and TPM algorithms is the computation of the cumu- 
lants/moments. However, the expressions (52) and (54) provide efficient formulas for fast com¬ 
putation of the GP/DICA cumulants and LDA moments (0{RNK + NK'^), where R is the largest 
number of non-zeros in the count vector x over all documents, see Appendix F.3 and F.5), which 
makes even the Matlab implementation fast for large datasets. Since all diagonalization algorithms 
(spectral, JD, TPM) perform the whitening step once, it is sufficient to compare their complexities 
by the number of times the cumulants/moments are computed. 

Spectral. The spectral algorithm estimates the cumulants/moments only once leading to 
0{NK{R -f K)) complexity and, therefore, is the fastest. 

JD. For JD, rather than estimating P cumulants/moments separately, one can jointly estimate these 
values by precomputing and reusing some terms (e.g., WX). However, the complexity is still 
0{PNK{R -f K)), although in practice it is sufficient to have P — K or even smaller. 

TPM. For TPM some parts of the cumulants/moments can also be precomputed, but as TPM nor¬ 
mally does many more iterations than P, it can be significantly slower. In general, the complexity 
of TPM can be significantly influenced by the initialization of the parameters of the algorithm. 
There are two main parameters: Ltpm is the number of random restarts within one deflation step 
and Ntpm is the maximum number of iterations for each of Ltpm random restarts (different from 
N and L). Some restarts converge very fast (in much less than Ntpm iterations), while others are 
slow. Moreover, as follows from theoretical results [4] and, as we observed in practice, the restarts 
which converge to a good solution converge fast, while slow restarts, normally, converge to a worse 
solution. Nevertheless, in the worst case, the complexity is 0{NtpmLtpmNK{R + K)). 

Note that for the experiment in Figure 1, Ltpm = 10 and Ntpm = 100 and the run with the best 
objective is chosen. We believe that these values are reasonable in a sense that they provide a good 
accuracy solution (e = 10“® for the norm of the difference of the vectors from the previous and the 
current iteration) in a little number of iterations, however, they may not be the best ones. 

JD implementation. For the orthogonal joint diagonalization algorithm, we implemented a faster 
Ch-h- version of the previous Matlab implementation’* by J.-F. Cardoso. Moreover, the orthogonal 
joint diagonalization routine can be initialized in different ways: (a) with the K x K identity matrix 
or (b) with a random orthogonal K x K matrix. We tried different options and in nearly all cases 
the algorithm converged to the same solution, implying that initialization with the identity matrix is 
sufficient. 

Whitening matrix. For the large vocabulary size M, computation of a whitening matrix can be 
expensive (in terms of both memory and time). One possible solution would be to reduce the vo¬ 
cabulary size with, e.g., TF-IDF score, which is a standard practice in the topic modeling context. 

https : //github . com/anastasia-podosinnikova/dica-light 
’'’https : / / git hub . com/anast as ia-podosinnikova/ dica 

’*http: //perso . telecom-paristech . fr/~cardoso/Algo/Joint_Diag/joint_diag_r . m 
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min 

mean 

max 

JD-GP 

148 

192 

247 

JD-LDA 

252 

284 

366 

JD(k)-GP 

157 

190 

247 

JD(k)-LDA 

264 

290 

318 

JD(f)-GP 

1628 

1846 

2058 

JD(f)-LDA 

2545 

2649 

2806 

Spec-GP 

101 

107 

111 

Spec-LDA 

107 

140 

193 

TPM-GP 

1734 

2393 

2726 

TPM-LDA 

12723 

16460 

19356 


Table 1: The running times in seconds of the algorithms from Figure 1 , corresponds to the case when N = 
50, 000. Each algorithm was run 5 times, so the times in the table display the minimum (min), mean, and 
maximum (max) time. 


Another option is using a stochastic eigendecomposition (see, e.g., [33]) to approximate the whiten¬ 
ing matrix. 

Variational inference. For variational inference, we used the code of D. Blei and modified it for the 
estimation of a non-symmetric Dirichlet prior c, which is known to be important [35]. The default 
values of the tolerance/maximum number of iterations parameters are used for variational inference. 
The computational complexity of one iteration for one document of the variational inference algo¬ 
rithm is 0{RK), where R is the number of non-zeros in the count vector for this document, which 
is then performed a significant number of times for each document. 

G.2 Runtimes of the algorithms 

In Table 1, we present the running times of the algorithms from Section 5.1. JD and JD(k) are 
significantly faster than JD(f) as expected, although the performance in terms of the £i-error is 
nearly the same for all of them. This indicates that preference should be given to the JD or JD(k) 
algorithms. 

The running time of all LDA-algorithms is higher than the one of the GP/DICA-algorithms. This 
indicates that the computational complexity of the LDA-moments is slightly higher than the one 
of the GP/DICA-cumulants (compare, e.g., the times for the spectral algorithm which almost com¬ 
pletely consist of the computation of the moments/cumulants). Moreover, the runtime of TPM-LDA 
is significantly higher (half an hour vs. several hours) than the one of TPM-GP/DICA. This can be 
explained by the fact that the LDA-moments have more noise than the GP/DICA-cumulants and, 
hence, the algorithm is slower. Interestingly, all versions of JD algorithm are not that sensitive to 
noise. 

Computation of a whitening matrix is roughly 30 sec (this time is the same for all algorithms and is 
included in the numbers above). 

G.3 Initialization of the parameter Co for the LDA moments 

The construction of the LDA moments requires the parameter cq, which is not trivial to set in the 
unsupervised setting of topic modeling, especially taking into account the complexity of the evalu¬ 
ation for topic models [16]. For the semi-synthetic experiments, the true value of co is provided to 
the algorithms. It means that the LDA moments, in this case, have access to some oracle informa¬ 
tion, which in practice is never available. For real data experiments, cq is set to the value obtained 
with variational inference. The experiments in Appendix G.4 show that this choice was somewhat 
important. However, this requires more thorough investigation. 

G.4 The LDA moments vs parameter co 

In this section, we experimentally investigate dependence of the LDA moments on the parameter cq. 
In Figure 5, the joint diagonalization algorithm with the LDA moment is compared for different 
values of cq provided to the algorithm. The data is generated similarly to Figure 2. The experiment 
indicates that the LDA moments are somewhat sensitive to the choice of cq. For example, the 
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Figure 5: Performance of the LDA moments depending on the parameter cq. D and c are learned from the 
AP dataset for K — IQ and Ff = 50 and true cq = 1. JD-GP(IO) for K = 10 and JD-GP(50) for K = 50. 
Number of sampled documents N = 20, 000. For the error bars, each dataset is resampled 5 times. Data (left): 
GP sampling; (right): LDAfix(200) sampling. Note: a smaller value of the fi-error is better. 




Figure 6: Comparison of the fi- and £ 2 - errors on the NIPS semi-synthetic dataset as in Figure 2 (top, left). 
The £2 norms of the topics were normalized to [0,1] for the computation of the I 2 error. 


recovery f'l-error doubles when moving from the correct choice cq = 1 to the plausible alternative 
Co = 0.1 for FT = 10 on the LDAfix(200) dataset (JD-LDA(IO) line on the right of Figure 5). 

G.5 Comparison of the fi- and F 2 -errors 

The sample complexity results [3] for the spectral algorithm for the LDA moments allow straightfor¬ 
ward extension to the GP/DICA cumulants, if the results from Proposition 3.1 are taken into account. 
The analysis is, however, in terms of the F 2 -norm. Therefore, in Figure 6, we provide experimental 
comparison of the £ 1 - and £ 2 -errors to verify that they are indeed behaving similarly. 

G.6 Evaluation of the real data experiments 

For the evaluation of topic recovery in the real data case, we use an approximation of the log- 
likelihood for held out documents as the metric. The approximation is computed using a Chib-style 
method as described by [16] using the implementation by the authors.'^ Importantly, this evaluation 
methods is applicable for both the LDA model as well as the GP model. Indeed, as it follows from 
Section 2 and Appendix B.l, the GP model is equivalent to the LDA model when conditioning on 
the length of a document L (with the same Ck hyper parameters), while the LDA model does not 
make any assumption on the document length. For the test log-likelihood comparison, we thus treat 
the GP model as a LDA model (we do not include the likelihood of the document length). 

G.7 More on the real data experiments 

The detailed experimental setup is as follows. Each dataset is separated into 5 training/evaluation 
pairs, where the documents for evaluation are chosen randomly and non-repetitively among the 
folds (600 documents are held out for KOS; 400 documents are held out for AP; 450 documents 
are held out for NIPS). Then, the model parameters are learned for a different number of topics. 
The evaluation of the held-out documents is performed with averaging over 5 folds. In Figure 3 and 
Figure 7, on the y-axis, the predictive log-likelihood in bits averaged per token is presented. 


*^http: //homepages . inf . ed . ac . uk/imurray2/pub/0 9etm 
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In addition to the experiments with AP and KOS in Figure 3, we demonstrate one more experiment 
with the NIPS dataset in Figure 7 (right). 

Note that, as the LDA moments require at least 3 tokens in each document, 1 document from the 
NIPS dataset and 3 documents from the AP dataset, which did not fulfill this requirement, were 
removed. 
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Figure 7: Experiments with real data. Left: the KOS dataset. Right: the NIPS dataset. Note: a higher value of 
the log-likelihood is better. 


Importantly, we observed that VI when initialized with the output of the JD-GP is consistently better 
in terms of the predictive log-likelihood. Therefore, the new algorithm can be used for more clever 
initialization of other LDA/GP inference methods. 

We also observe that the joint diagonalization algorithm for the LDA moments is worse than the 
spectral algorithm. This indicates that the diagonal structure (41) and (42) might not be present in the 
sample estimates (43) and (44) due to either model misspecification or to hnite sample complexity 
issues. 
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